1 Dataset Description

A total of 7880 individuals from 2611 families were genotyped on the Illumina Human1M-Duov3_B or the Human1Mv1_C.

  • 4901 males, 2979 females.
  • 2571 trios, 36 quads, 1 pentas, 3 hexs.
  • 947,233 SNPs were genotyped.
  • Coordinates were based on Build36.


2 Raw Genotype QC

2.1 Sex Check

  • 141 PRROBLEM
    • 115 with complete missing chrX genotypes.
    • 26 with chrX-F ranging from 0.20 to 0.62

2.1.1 Mismatch summary



2.1.2 ChrX F distributions



2.2 Pariwise IBD estimation

  • Relationships (RT): OT (Others), FS (Full Siblings), PO (Parent Offspring)
  • family ID 483 has potential issue
    • inbreeding coefficient = 1 between IID:328 (Female) and IID:1491 (Female)
    • MZ? same individual? The genotype missing rates are 0.1185 and 0.1186 for IID:483_328 and IID:483_1491, respectively. [Drop IID:483_1491, IID:483_4371 and IID:483_993.]

2.2.1 Estimated pairwise IBD distributions



2.2.2 Family 483



2.3 Individual genome-wide heterozygosity

2.3.1 Genome-wide heterozygosity VS missing rates



Note that samples were genotyped in the Human1M-Duov3_B or the Human1Mv1_C. Genotypes for these individuals are an union of the genotypes from both platforms.

  • In the final merged dataset 947,233 markers are represented. Of these 938,130 are represented on the Human1M-Duov3_B while 858,052 are on the Human1Mv1_C.
  • For individuals on the Human1M-Duov3_B the missingness rate is at least 1% while for individuals genotyped on the Human1Mv1_C it is at least 9%.
  • Individuals were checked to have no more than a 5% missingness rate for the platform they were genotyped on.


2.3.2 Genome-wide F VS missing rates



3 Imputation

3.1 Pre-imputation

The imputation pipeline follows that used for SSC dataset. A total of 7769 individuals and ~784K autosomal, ~22K chrX SNPs were used for further impution.

  • filters: -- geno 0.05 --mind 0.2 --maf 0.01 --hwe 1e-6
    • 111 people removed due to missing genotype data (–mind).
    • Total genotyping rate in remaining samples is 0.914029.
    • 124565 variants removed due to missing genotype data (–geno).
    • 15633 variants removed due to Hardy-Weinberg exact test.

Note that a liberal threshold 0.2 was used for individual genotype missing rates (–mind) for AGP data here since, a large number of individuals with imiss > 0.1. 111 people with imiss ranging from 0.7 to 1. As noted in Section 2.3.1, the samples were combined from two genotype arrays. Before merging the genotypes, the missingness rates were confirmed to be < 0.1.



3.2 After Imputation

3.2.1 Frequency distribution

  • overlapped SNPs between SSC_imputed and HRC (~7.6M SNPs passing filters: --geno 0.05 --maf 0.01 --hwe 1e-6)
  • based on same allele
  • 0 SNPs with MAF difference > 0.2



3.2.2 PCA

  • Project the first 3 PCs based on pruned HapMap3 SNPs onto 1000G
  • Using K-means to calculate distance
  • Assign ancestry based on posterior probability 0.9
    • 6548 Europeans (EUR), 625 Americans (AMR), 123 South-Asians (SAS), 99 East-Asians (EAS) and 141 Africans (AFR).